3.3.5. Better Plots¶
3.3.5.1. The theory of good visuals¶
There is an enormous amount of scholarship and debate about what makes for effective graphs and I can’t possibly do the field justice. Below is simply one person’s distillation of some tips that are reasonably well agreed upon. I’m aiming for concise here so that we can practice, but if you want more, visit the links below and links in the last lecture.
pie charts: humans stink at interpreting angles
stacked bar charts: tough to decode trends
make your reader do math: if \(x-y\) is interesting, don’t plot \(x\) and \(y\) separately, just plot \(x-y\)
misleading scales
3D unless absolutely necessary (and it almost surely isn’t)
distracting chart junk
unnecessary colors
spaghetti charts: too many lines

Show the data, reduce the clutter, and integrate the text and the graph
graphs should aspire to be sufficient to understand without reading the text
Control the aspect ratio
Think about whether you need to include zero. Sometimes excluding it makes the figure misleading. Sometimes including it (and expanding the y-axis to do so) hides the variation you’re describing.
Facilitate comparisons:
by placing figure components next to or above (depends!) the stuff it is compared to
by using the same axis (two y-axes is usually bad!)
labels > legends! (so readers eyes don’t have to dart back and forth)
sort in meaningful orders (i.e. not alphabetically!)
3.3.5.2. Transforming bad figures to good ones¶

Look at the before/after examples here. This article is also wonderful for understanding the “why”s of good data viz
3.3.5.2.1. Customizing figure aspects¶
Create your plot in pandas or seaborn
Format the figure as much as possible from within the pandas or seaborn function. I have some info on that below.
If/when necessary, use
matplotlibto customize the figure.
After you create a figure object, subsequent calls to that object will modify it
Copy this code below into a python file and run it. Then uncomment out the next line, and rerun. See the change it made. Then uncomment the next line, rerun, and so on.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 2, 100)
plt.plot(x, x, label='linear') # creates plt obj
# plt.plot(x, x**2, label='quadratic') # adds another plot on top
# plt.plot(x, x**3, label='cubic') # again
# plt.xlabel('x label')
# plt.ylabel('y label')
# plt.title("Simple Plot")
# plt.legend()
# plt.show()
[<matplotlib.lines.Line2D at 0x27c7adcf670>]
For changes outside the pd and sns plot functions: Honestly, I can’t do better than this page.
```{dropdown} Formatting plots in pandas
todo
```{dropdown} Formatting plots in `seaborn`
todo
3.3.5.3. Practice: Thinking and planning¶
Questions: For Q1-Q3, which type of graph (bar, line, or histogram) would you use?
The volume of apples picked at an orchard based on the type of apple (Granny Smith, Fuji, etcetera).
The number of points for each game in a basketball season for a team.
The count of apartment buildings in Chicago by the number of individual units.
Suppose we create a scatter plot but find that due to the large number of points it’s hard to interpret. What are two things we can do to fix this issue?
Suppose that we create an n-by-n FacetGrid. How big can “n” get?
What are the two things about faceting which make it appealing?
When is
sns.pairplotmost useful?
** Answers**
Q1
This is a nominal categorical example, and hence, a pretty straightforward bar graph target.
Q2
This is a (nearly) continuous variable, with 82 observations (games). 82 bars is too much for a bar chart. But a line chart, histogram (or density plot), or boxplot would all work.
Q3
Density chart would work, but you could also use a histogram as long as you “bin” apartment buildings (<10 units, 10-50 units, etc…) Note that this variable will be skewed because only a few buildings have 500+ units.
Q4
One way to fix this issue would be to sample the points. Another way to fix it would be to use a hex plot.
Q5
A matter of size and visibility, but 5x5 is probably as large as you want to go.
Q6
It’s a easy way to show info about additional variables of interest to a figure.
Q7
Especially useful when you’re exploring the dataset.
3.3.5.4. Interactive plots: plotly¶
I want to show you how far we can push this explore leverage and firm value. The code uses plotly’s subpackage plotly-express which is ridiculously easy to use, for how cool these plots are.
And as an exercise, you might critique these - I certainly think there are aspects to improve!
#!pip install plotly
%matplotlib inline
import pandas as pd
import numpy as np
import plotly.express as px # pip install plotly.. the animation below is from plotly module
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = 'https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/CCM_cleaned_for_class.zip?raw=true'
#firms = pd.read_stata(url) <-- would work, but GH said "too big" and forced me to zip it,
# so here is the work around to download it:
with urlopen(url) as request:
data = BytesIO(request.read())
with ZipFile(data) as archive:
with archive.open(archive.namelist()[0]) as stata:
firms = pd.read_stata(stata)
# firms = pd.read_stata('https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/CCM_cleaned_for_class.zip?raw=true')
firms.name = "Firms"
# https://jupyterbook.org/guide/05_faq.html#How-can-I-include-interactive-Plotly-figures?
# the lines before and after the fig help make sure this is viewable on the website
# but shouldn't be necessary just for notebook viewing... but I'm not sure about github viewing
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot
init_notebook_mode(connected=True)
fig = (
firms
.query('(fyear < 2014) & (mb < 5) & (td_a >= 0) & (td_a < 1.5) ') # some sensible limits
.groupby(['state','gsector','fyear'])
.agg({'td_a':'mean','mb':'mean','at':'sum','lpermno':'count'
}) # we need the # of firms per industry-state for an extra filter
# and I wanted the total assets summed so bigger industries get bigger circles
.rename(columns={'td_a':'Avg Book Leverage', 'mb':'Avg Market to Book','lpermno':'Num_Firms'})
.query('Num_Firms > 20 ') # disgard small industry-states
.reset_index() # get fyear as a variable for plotting function
.pipe(
(px.scatter,'data_frame'),
y='Avg Market to Book', x='Avg Book Leverage', animation_frame="fyear",
range_x=[0,.5], range_y=[0,2], hover_data=["state","gsector"],
title = "State-By-Industry Avg Leverage and Avg Firm Value"
)
)
plot(fig, filename = 'ind-state mb vs lev.html')
display(HTML('ind-state mb vs lev.html'))
fig = (
firms
.query('(fyear < 2014) & (mb < 5) & (td_a >= 0) & (td_a < 1.5) ') # some sensible limits
.query('state in ["CA","NY"] & gsector in ["40","45"]') # sample restriction
.rename(columns={'td_a':'Book Leverage'})
.reset_index() # get fyear as a variable for plotting function
.pipe(
(px.scatter,'data_frame'),
y='mb',x='Book Leverage',animation_frame="fyear",
range_x=[0,1.5], range_y=[0,5],
facet_row="gsector", facet_col="state",
hover_data=["state","gsector"],
title = "Leverage and Firm Value"
)
)
plot(fig, filename = 'mb vs lev for each state-ind.html')
display(HTML('mb vs lev for each state-ind.html'))
One more: This is a replication of a famous Hans Rosling TED talk figure using the well-known gapminder data:
fig = px.scatter(px.data.gapminder(), x="gdpPercap", y="lifeExp",
size="pop", color="continent",animation_frame="year",
range_y=[30,85],
hover_name="country", log_x=True, size_max=60)
plot(fig, filename = 'hans.html')
display(HTML('hans.html'))